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Abstract 

A  new  approach  is  proposed  for  clustering  time-series 
data.  The  approach  can  he  used  to  discover  groupings  of 
similar  object  motions  that  were  observed  in  a  video  collec¬ 
tion.  A  finite  mixture  of  hidden  Markov  models  (HMMs)  is 
fitted  to  the  motion  data  using  the  expectation-maximization 
(EM)  framework.  Previous  approaches  for  HMM-based 
clustering  employ  a  k-means  formulation,  where  each  se¬ 
quence  is  assigned  to  only  a  single  HMM.  In  contrast,  the 
formulation  presented  in  this  paper  allows  each  sequence  to 
belong  to  more  than  a  single  HMM  with  some  probability, 
and  the  hard  decision  about  the  sequence  class  member¬ 
ship  can  be  deferred  until  a  later  time  when  such  a  decision 
is  required.  Experiments  with  simulated  data  demonstrate 
the  benefit  of  using  this  EM-based  approach  when  there  is 
more  “overlap”  in  the  processes  generating  the  data.  Ex¬ 
periments  with  real  data  show  the  promising  potential  of 
HMM-based  motion  clustering  in  a  number  of  applications. 


1.  Introduction 

In  the  past  decade,  there  has  been  an  explosive  growth  in 
the  number  systems  that  gather  and  store  data  about  the  mo¬ 
tion  of  objects,  machines,  vehicles,  humans,  animals,  etc. 
These  data  sets  are  collected  and  analyzed  for  a  broad  range 
of  applications,  too  numerous  to  mention.  The  parameter¬ 
ization  and  dimensionality  of  the  motion  time  series  data 
can  vary  widely,  depending  on  the  particular  input  device, 
tracking  method,  motion  model,  relevant  degrees  of  free¬ 
dom,  etc. 

Given  the  size  and  diversity  of  these  motion  data 
archives,  clearly,  general-purpose  tools  are  needed  for 
grouping  and  organizing  the  motion  patterns  contained 
therein.  For  instance,  methods  for  discovering  clusters  of 
similar  motion  sequences  in  these  data  sets  would  enable 
pattern  discovery,  anomaly  detection,  modeling,  summa¬ 
rization,  etc.  Furthermore,  knowledge  of  clusters  could  be 
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exploited  in  data  reduction,  as  well  as  in  efficient  methods 
for  sequence  indexing  and  retrieval. 

In  this  paper,  we  focus  on  the  problem  of  finding  groups, 
or  clusters  of  similar  object  motions  within  a  database  of 
motion  sequences,  and  estimating  motion  time  series  mod¬ 
els  based  on  these  groups.  We  employ  a  probabilistic 
model-based  approach,  where  the  data  is  assumed  to  have 
been  generated  by  a  finite  mixture  model.  In  particular,  we 
assume  that  the  observed  sequences  are  generated  by  a  fi¬ 
nite  mixture  of  hidden  Markov  models  (HMMs),  and  esti¬ 
mate  this  mixture  of  HMMs  via  an  Expectation  Maximiza¬ 
tion  (EM)  formulation. 

Previous  approaches  for  HMM-based  clustering  employ 
a  k-means  formulation  [11, 13,  16],  which  has  the  drawback 
that  in  each  iteration,  each  sequence  can  only  be  assigned  to 
a  single  cluster,  and  then  only  those  sequences  that  are  as¬ 
signed  to  a  particular  cluster  are  used  in  the  re-estimation  of 
its  HMM  parameters.  This  can  lead  to  problems  when  there 
is  not  a  particularly  good  separation  between  the  underlying 
processes  that  generated  the  groups  of  time  series  data. 

In  our  proposed  EM  approach,  a  sequence  can  be  par¬ 
tially  assigned  to  all  clusters,  with  the  degree  of  cluster 
membership  determined  by  the  a  posteriori  probability, 
which  depends  via  Bayes  rule  on  the  cluster  a  priori  proba¬ 
bility  and  the  data  likelihood.  In  contrast  with  k-means,  the 
parameter  estimates  of  a  single  HMM  cluster  are  influenced 
by  all  the  observation  sequences  with  the  corresponding  a 
posteriori  probabilities.  The  hard  decision  about  the  se¬ 
quence  class  membership  can  be  deferred  until  a  later  time 
when  such  a  decision  is  required.  In  experimental  evalua¬ 
tion,  this  EM-based  formulation  tends  to  yield  improved  ac¬ 
curacy  over  the  k-means  approach,  particularly  when  there 
is  more  “overlap”  in  the  processes  that  generated  the  data. 

2.  Related  Work 

Methods  for  clustering  time- series  data  have  been  pro¬ 
posed  recently,  particularly,  in  the  statistics  and  data  min¬ 
ing  communities.  (See  e.g.,  [4]).  In  the  computer  vi- 
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sion  community  clustering  of  motion  data  has  been  used 
mainly  for  classification  and  prediction  of  pedestrian  trajec¬ 
tories  [8,  18],  and  for  event-based  analysis  of  long  video 
sequences  [18,  19]. 

Methods  for  clustering  sequences  or  temporal  data  us¬ 
ing  hidden  Markov  models  have  been  proposed  in  speech 
recognition  [9],  computational  biology  [3],  and  machine 
learning  [16,  11,  13].  In  computer  vision,  hidden  Markov 
models  have  been  successfully  used  in  the  supervised  learn¬ 
ing  and  recognition  of  specific  activities  [2]  and  gestures 
[17];  however,  to  the  best  of  our  knowledge,  unsupervised 
learning  or  clustering  using  hidden  Markov  models  has  re¬ 
ceived  little  or  no  attention. 

A  complete  approach  to  HMM-clustering  should  address 
four  key  problems  [11]:  estimating  the  parameters  of  a  sin¬ 
gle  HMM  cluster,  selecting  the  number  of  HMM  clusters, 
learning  HMM  structure  (topology  and  size),  and  assigning 
sequences  to  clusters. 

The  fourth  problem,  and  the  focus  of  this  paper  is  how 
to  assign  sequences  to  clusters.  In  previous  approaches 
[11,  13]  the  sequences  are  assigned  in  a  winner-take-all 
manner:  a  sequence  is  assigned  to  the  HMM  that  most  likely 
generated  it.  In  contrast,  the  EM-based  formulation  given  in 
this  paper  enables  soft  assignment,  whereby  each  sequence 
is  partially  assigned  to  a  cluster  according  to  the  cluster  a 
posteriori  probability  given  the  sequence.  This  EM-based 
algorithm  is  much  preferred  in  noisy  data  situations  or  when 
the  underlying  densities  are  with  more  “overlap”  [10]. 

3.  Background:  HMMs 

In  this  section,  we  give  a  brief  review  of  hidden  Markov 
models  (HMMs).  Our  main  purpose  is  to  define  notation 
used  in  our  clustering  formulation.  For  a  detailed  overview 
of  HMMs,  readers  are  directed  to  [14]. 

A  complete  specification  of  a  first-order  HMM  with  a 
simple  Gaussian  observation  density  is  formally  given  by: 

1.  N  states,  S  =  {Si,  S2,  •  •  • ,  Sat}. 

2.  The  state  transition  probability  distribution  A  =  {a^}, 
where  =  P(qt+ 1  =  Sj\qt  =  Si ),  1  <  i,  j  <  N. 

3.  The  observation  density  bj(Ot )  =  A f(Ot‘,  Pj,  £j),  1  < 
j  <  N,  where  pj,  and  £j  are  the  mean  and  covariance 
of  the  Gaussian  of  state  j. 

4.  The  initial  state  probability  distribution,  7 r  =  {7 r*}, 
where  7^  =  P(q1  =  Si),  1  <  i  <  N 

where  Ot  and  qt  are  the  observation  and  state  respectively 
at  time  t.  It  is  common  to  use  the  compact  notation 


The  problem  of  estimating  the  parameters  of  a  HMM  A* 
given  L  independent  sequences  0^l\  1  <  l  <  L  can  be  cast 
as  a  maximum  likelihood  (ML)  problem 

L 

A*  =  arg  max  n  P(Ow|A).  (2) 


Unfortunately,  there  is  no  known  analytical  way  for  find¬ 
ing  the  global  ML  solution.  The  well  known  Baum- Welch 
algorithm  is  an  iterative  procedure  that  can  only  guarantee 
convergence  to  a  local  maximum.  It  consists  of  the  follow¬ 
ing  re-estimation  formulas  [14]: 
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where  7 t(i)  is  the  probability  of  being  in  state  Si  at  time 
£,  given  the  observation  sequence  O  and  the  model  A,  and 
is  the  probability  of  being  in  state  Si  at  time  t  and 
state  Sj  at  time  t  +  1,  given  the  observation  sequence  O  and 
the  model  A.  These  two  variables  can  be  computed  by  the 
forward-backward  algorithm  [1]. 


4.  Finite  Mixture  of  HMMs 

In  this  section  we  present  a  specialization  of  the  general 
framework  for  probabilistic  model-based  clustering,  where 
the  cluster  models  are  HMMs.  In  HMM-based  clustering 
it  is  assumed  that  an  observation  sequence  O  is  generated 
according  to  a  mixture  distribution  of  M  components.  Let 
p((3|A(m))  be  the  probability  distribution  of  the  sequence 
O  given  the  m’th  HMM  parameterized  by  A^m\  and  let 

=  (zii, . . . ,  zim)  be  the  cluster  membership  vector  for 
the  V  th  sequence,  where  zim  =  1  if  sequence  O ^  was  gen¬ 
erated  by  the  m’th  HMM  and  0  otherwise.  Then,  the  zim’ s 
can  be  treated  in  one  of  two  ways:  as  fixed  but  unknown 
parameters,  or  alternatively  as  missing  binary  random  vari¬ 
ables.  In  the  first  case,  we  want  to  estimate  the  set  of  pa¬ 
rameters  6  =  {\(m\zi\l  <  m  <  M,  1  <  l  <  L}  that 
maximize  the  likelihood  function 


A  =  (ir,A,{ij,j},{'Ej}),  1  <  j  <  N  (1) 
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to  indicate  the  complete  parameter  set  of  the  model. 
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In  the  second  case,  each  zim  has  a  prior  probability  p of 
being  generated  by  the  ra’th  HMM.  In  this  case  we  want  to 
estimate  the  set  of  parameters  6  =  <  m  < 

M}  that  maximize  the  likelihood  function 

L  M 

cm = n  e  o'""  P(0|A(m)),  (8) 

1=1  m= 1 

It  is  well  known  that  ML  estimates  cannot  be  found  analyti¬ 
cally  for  neither  likelihood  function,  and  one  must  resort  to 
iterative  procedures.  K-means  is  an  iterative  procedure  for 
finding  the  estimates  for  the  first  likelihood  function  (Eq.  7). 
In  k-means,  each  iteration  consists  of  two  steps: 

•  Assign  sequences  to  clusters  A^m) 

1  if  m  =  argmax^  P(0^  |  A^) 

0  otherwise. 

•  Update  estimates  A^m^ 
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EM  is  an  iterative  procedure  for  finding  the  ML  estimates 
for  the  mixture  likelihood  function  (Eq.  8).  Similar  to  the 
k-means  algorithm,  each  iteration  consists  of  two  steps: 

•  E-Step 
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where  the  last  equality  follows  from  Bayes  law,  p ^  is  the 
a  priori  probability  that  zim  =  1,  and  wim  is  the  posterior 
probability  that  =  1  after  observing  0^l\  and 


•  M-Step 

EL  sr~^  L 

_  1=1  wlm  _  1^1=1  wlm 

V  v^L  t 

Z2m=l  2Zl=l  Wlm 

and  replace  with  in  Eq.  (10)-(13)  throughout.  We 
note  that  the  computation  of  the  posterior  probabilities  in 
the  E-step  (Eq.  14)  requires  special  arithmetic  manipula¬ 
tions  to  avoid  numerical  problems.  Otherwise,  the  compu¬ 
tation  cost  of  EM  and  k-means  is  similar. 


5.  Model  Selection 

In  the  previous  section,  we  have  assumed  that  the  num¬ 
ber  of  mixture  components  M  is  known.  The  problem  of  es¬ 
timating  the  “correct”  number  of  clusters  is  a  difficult  one: 
a  full  Bayesian  solution  for  obtaining  the  posterior  proba¬ 
bility  on  M,  requires  a  complex  integration  over  the  HMM 
parameter  space,  as  well  as  knowledge  about  the  priors  on 
the  mixture  parameters  and  about  the  priors  on  M  itself. 
Often  this  integration  cannot  be  solved  in  closed  form,  and 
Monte-Carlo  methods  and  other  approximation  methods  are 
used  to  evaluate  it.  However,  these  methods  are  computa¬ 
tionally  intensive.  Other  methods  that  sacrifice  some  accu¬ 
racy  for  efficiency,  are  the  penalized  likelihood  approaches, 
where  the  log-likelihood  term  is  penalized  by  subtraction  of 
a  complexity  term.  We  use  such  a  method  that  tries  to  find 
a  model  with  Minimum  Description  Length  (MDL)  [15]. 
Assuming  all  the  models  are  equally  likely  a-priori  we  can 
write: 


log  P(M\0)  «  log  P(0\M,  9)  -  ^log  L,  (16) 

where  P(M\0 )  is  the  approximate  posterior  distribution  on 
M,  P(0\M,  0)  is  the  data  likelihood  term  (Eq.  7  for  k- 
means  and  Eq.  8  for  EM)  given  the  ML  estimates  6 ,  and 
|  log  L  is  the  MDL  term,  where  d  =  M  +  \6\  is  the  number 
of  model  parameters. 

6.  Experimental  Evaluation 

The  two  HMM-based  clustering  algorithms  (EM  and  k- 
means)  described  in  Section  4  were  implemented  in  Matlab 
using  a  HMM  toolbox  [12].  The  measure  used  for  testing 
the  validity  of  our  clustering  results  is  classification  accu¬ 
racy.  The  reason  is  that  for  all  our  experiments  ground-truth 
was  available,  so  the  majority  class  of  each  cluster  could  be 
associated  with  the  cluster  itself,  which  enabled  the  compu¬ 
tation  of  classification  accuracy.  When  ground-truth  is  not 
available,  or  when  it  is  not  known  whether  the  data  points 
can  be  naturally  clustered,  other  validity  measures  should 
be  employed  [7].  The  purpose  of  the  experiments  with 
simulated  data  is  to  compare  performance  of  the  k-means 
and  the  EM-based  approaches  for  clustering  sequences  with 
HMMs.  The  purpose  of  the  experiments  with  real  data  is  to 
demonstrate  the  usefulness  of  HMM-based  clustering  in  a 
number  of  applications. 

6.1  Experiment  with  Synthetic  Data 

In  this  experiment,  100  sequences  of  length  200  are  gen¬ 
erated  from  a  2-component  HMM  mixture  (50  sequences 
from  each  component).  Both  HMMs  are  modeled  with  two 


Method 

Separation  ^ 

1.00 

1.25 

1.50 

1.75 

2.00 

2.25 

2.50 

2.75 

3.00 

True 

0 

0 

3 

28 

44 

50 

50 

50 

50 

EM 

0 

0 

0 

1 

22 

46 

49 

47 

47 

k-means 

0 

0 

0 

0 

21 

29 

30 

44 

49 

Table  1.  EM  vs.  k-means  classification  accuracy.  Each  entry  in  the  table  is  the  number  of  trials 
for  which  the  method  achieved  the  specified  classification  accuracy  as  a  function  of  the  separation 
between  the  observation  Gaussian  densities 


states,  and  a  1-d  Gaussian  observation  density,  in  a  manner 
similar  to  [16].  Using  Dynamic  Time  Warping  (DTW)  for 
initial  clustering,  both  models  were  initialized  in  the  stan¬ 
dard  way:  uniform  priors,  uniform  transition  matrices,  and 
means  and  variances  were  estimated  using  k-means.  In  this 
experiment  the  amount  of  overlap  between  the  generating 
models  was  varied  by  varying  the  mean  separation  between 
the  Gaussian  outputs  of  the  two  states,  in  a  similar  way 
for  both  HMMs,  leaving  all  the  other  parameters  fixed.  To 
avoid  confusion  between  states  and  HMM  clusters,  we  use 
superscripts  to  denote  HMM  clusters,  and  subscripts  to  de¬ 
note  states. 


optimal  classifiers,  and  their  performance  gives  an  upper 
bound  on  the  possible  classification  accuracy. 

The  results  from  Table  1  indicate  that  when  the  separa¬ 
tion  between  the  Gaussians  in  state  1  and  2  is  large  enough 
(3.00),  both  k-means  and  EM  achieve  near  optimal  classi¬ 
fication  accuracy.  When  the  separation  between  the  Gaus¬ 
sians  is  very  small  (<  2.00),  both  approaches  perform  badly 
as  expected,  as  the  Gaussians  become  practically  indistin¬ 
guishable.  Between  these  two  extremes  the  EM  seems  to 
outperform  k-means. 

6.2  Experiments  with  Real  Data 


6.1.1  Experiment  1 :  Sensitivity  to  Observation  Noise  6.2.1  Experiment  2:  Camera  Mouse 


In  this  experiment  the  HMMs’  dynamics  were 


A&  = 


0.6  0.4  \ 

0.4  0.6  )  ’ 


0.4  0.6 
0.6  0.4 


,  (17) 


and  for  both  HMMs  the  standard  deviations  of  the  Gaus¬ 
sians  were  kept  fixed  o\  =  cf 2  =  1,  the  mean  of  the 
first  state  was  kept  fixed  /jl%  =  0,  and  the  mean  of  the 
second  state  (for  both  HMMs)  varied  in  the  range  /12  = 
(1.00, ...,  3.00),  a  total  of  nine  values.  This  corresponds 
to  a  change  in  which  is  the  normalized  mean  separa¬ 
tion  between  the  two  Gaussians.  For  each  of  the  nine  values 
of  50  trials  were  run,  and  for  each  trial,  classification 
accuracy  was  computed.  Model  selection  was  applied  only 
to  the  data  generated  in  the  50  trials  corresponding  to  mean 
separation  ^  =  3.00.  In  46  out  of  the  50  trials  the  EM  al¬ 
gorithm  correctly  found  2  HMM  clusters.  The  k-means  al¬ 
gorithm  found  the  correct  number  of  clusters  in  all  of  the  tri¬ 
als.  We  instantiated  all  other  trials  with  a  two-components 
HMM  mixture. 

The  results  are  summarized  in  Table  1.  Each  entry  in 
Table  1  is  the  number  of  trials  (out  of  50)  for  which  the 
method  achieved  90%  classification  accuracy.  Classifica¬ 
tion  results  are  given  for  the  k-means  and  EM  methods, 
and  for  reference  we  give  the  classification  accuracy  of  the 
“true”  HMMs  that  generated  the  sequences.  These  are  the 


In  this  experiment,  we  cluster  2D  time- series  data  obtained 
via  the  Camera  Mouse  system  [5].  As  shown  in  Fig.  1(a), 
a  correlation-based  video  tracking  subsystem  estimates  the 
image  position  of  a  selected  facial  feature.  In  these  experi¬ 
ments,  the  tip  of  the  user’s  nose  was  tracked.  Motion  of  the 
user’s  head/nose  then  drives  an  onscreen  cursor,  and  thereby 
enables  the  user  to  control  a  hierarchical  spelling  interface, 
as  shown  in  Fig.  1(b).  In  the  top-level  menu  of  the  hier¬ 
archical  spelling  interface,  the  alphabet  is  divided  into  five 
sub-alphabets.  By  moving  the  cursor,  the  user  selects  that 
sub-alphabet  which  contains  the  desired  letter,  and  then  a 
second  menu  appears  that  allows  the  user  to  pick  the  de¬ 
sired  letter  from  the  sub-alphabet. 

For  this  particular  experiment,  the  subjects  used  the 
Camera  Mouse  system  to  spell  the  words:  “athens”, 
“berlin”,  “london”,  “boston”,  and  “paris”.  The  number  of 
sequences  obtained  for  each  word  were  3,  4,  4,  5,  4  respec¬ 
tively,  yielding  a  total  of  20  sequences.  The  average  length 
of  a  sequence  is  around  1100.  The  shortest  sequence  length 
is  834,  and  the  longest  one  is  1719.  Figure  2  shows  four  se¬ 
quences:  two  corresponding  to  the  word  “berlin”,  and  two 
corresponding  to  the  word  “london”. 

The  model  selection  criterion  was  tested  with  values  of 
M  varying  in  the  range  between  1  and  6.  Since  the  se¬ 
quences  were  relatively  long  the  likelihood  term  dominated 
the  MDL  term,  and  a  the  maximum  value  of  6  was  selected 


600- 


Figure  1 .  Spelling  via  the  CameraMouse  inter¬ 
face.  (a)  The  correlation-based  video  tracking 
system  estimates  the  image  position  of  a  se¬ 
lected  facial  feature.  In  these  experiments, 
the  tip  of  the  user’s  nose  was  tracked,  (b)  Mo¬ 
tion  of  the  user’s  nose  then  moves  the  cursor 
on  the  screen  to  spell  words  via  a  hierarchical 
spelling  interface  as  described  in  the  text. 


as  the  “correct”  number  of  clusters  for  both  the  EM  and 
k-means  algorithms.  The  sequences  corresponding  to  the 
word  “boston”  were  distributed  among  two  clusters  by  both 
algorithms.  In  what  follows,  we  chose  to  ignore  the  esti¬ 
mated  number  of  clusters  and  initialized  the  algorithms  with 
the  true  number  of  classes,  5. 

Two  different  initializations  were  tested:  (1)  initializa¬ 
tion  with  DTW  (in  a  similar  manner  to  [13]),  and  (2)  ran¬ 
dom  initialization.  With  DTW,  16  out  of  the  20  sequences 
were  initially  correctly  classified,  and  using  this  as  initial 
classification,  both  HMM-clustering  approaches  were  able 
to  move  one  more  sequence  (corresponding  to  the  word 
’boston’)  to  its  correct  class,  so  17  out  of  20  were  eventually 
correctly  classified.  With  random  initialization  only  10  out 
of  the  20  sequences  were  initially  correctly  classified,  and 
after  HMM-clustering  16  out  the  20  sequences  were  cor¬ 
rectly  classified.  We  noticed  that  the  EM  and  k-means  per¬ 
formed  essentially  the  same,  because  the  sequences  lengths 
were  long  enough,  causing  the  likelihood  ratios  to  dominate 
the  prior  probabilities  ratios,  and  therefore  causing  the  EM 
soft  posteriors  to  approach  the  k-means  hard  posteriors. 


Figure  2.  Four  tracking  sequences:  two  corre¬ 
sponding  to  the  word  “berlin”(blue  dashed), 
and  two  corresponding  to  the  word  “lon- 
don”(red  solid).  The  graph  depicts  the  x  and 
y  position  of  the  user’s  nose. 


6.2.2  Experiment  3:  Georgia-Tech  Gait  database 

We  conducted  experiments  with  Gait  data  that  was  collected 
for  the  Human  ID  project  at  the  computational  perception 
lab  at  Georgia  Tech.  This  database  is  available  online  at 
http://www.cc.gatech.edu/cpl/projects/hid/.  This  database 
consists  of  video  sequences  of  20  walking  human  subjects 
taken  under  various  viewing  conditions  We  used  a  subset 
of  this  database  for  our  experiment:  a  total  of  45  sequences 
of  15  subjects  (3  sequences  per  subject),  for  which  binary 
masks,  extracted  using  a  background  subtraction  algorithm, 
was  available. 

All  sequences  were  taken  under  the  same  viewing  con¬ 
ditions.  Fig.  3  shows  example  frames  of  two  out  of  the  45 
sequences  pertaining  to  subjects  3  and  5.  Binary  masks  are 
shown  below  their  corresponding  frames.  For  each  binary 
mask  we  computed  the  aspect  ratio  of  its  bounding  box,  and 
thus  for  each  of  the  45  sequences  we  obtained  an  aspect  ra¬ 
tio  time-series.  An  average  smoothing  filter  of  size  five  was 
applied  to  the  time- series.  The  resulting  smoothed  aspect 
ratio  time- series  for  subjects  3  and  5  are  depicted  in  Fig.  4. 

The  model  selection  criterion  was  tested  with  5,  10,  and 
15  clusters,  the  true  number  of  classes.  The  EM  algorithm 
selected  the  value  M  —  5,  and  the  k-means  algorithm  se¬ 
lected  M  - =  10.  Model  selection  underestimated  the  num¬ 
ber  of  clusters  because  the  aspect  ratio  is  not  the  most  dis¬ 
criminative  feature,  and  in  such  cases  MDL  will  tend  to  sim¬ 
plify  the  model  even  more.  In  what  follows,  we  chose  to 
ignore  the  estimated  number  of  clusters  and  initialized  the 
algorithms  with  the  true  number  of  classes,  15. 

In  order  to  reduce  the  sensitivity  of  the  HMM-clustering 
algorithm  to  the  initial  grouping,  we  tried  two  initializa¬ 
tions:  Smyth’s  method  [16],  and  Dynamic  time  warping. 
Both  methods  use  agglomerative  hierarchical  clustering, 
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Figure  3.  Example  frames  and  corresponding 
silhouettes  extracted  from  image  sequences 
pertaining  to  subjects  3  and  5. 


and  the  clusters  are  formed  by  cutting  the  hierarchy  at  level 
15.  Smyth’s  method  uses  the  KL  divergence  as  the  dissim¬ 
ilarity  measure,  while  DTW  uses  the  warping  distance  as 
the  dissimilarity  measure.  The  results  of  the  agglomerative 
algorithms  were  used  as  initial  clusterings,  and  were  refined 
using  the  two  HMM  partition-based  clustering  algorithms, 
namely  k-means  and  EM. 

The  input  time- series  depicted  in  Fig.  4  can  be  modeled 
as  sine  waves  plus  noise.  We  therefore  selected  a  periodic 
HMM  structure  that  consists  of  four  states  corresponding 
to  the  sine  valley,  zero  crossing  up,  peak,  and  zero  cross¬ 
ing  down.  This  is  a  more  compact  representation  than  the 
one  typically  used  to  represent  periodic  signals  with  HMMs 
[6,  13].  The  typical  model  consists  of  as  many  states  as  ob¬ 
servations  in  a  single  period.  The  features  we  used  are  the 
aspect  ratio  and  its  first  derivative. 

The  results  of  this  experiment  are  summarized  in  Ta¬ 
ble  2.  The  results  indicate  that  the  EM  and  k-means  pro¬ 
cedures  improve  the  initial  clusters  obtained  from  DTW. 
Smyth’s  procedure  yields  good  initial  clusters.  EM  and  k- 
means  assigned  a  few  more  sequences  to  clusters  with  sim¬ 
ilar  sequences,  thus  reducing  the  number  of  clusters  and 
consequently  the  classification  accuracy.  The  classification 
accuracies  of  the  clustering  algorithms  are  very  similar  to 
the  classification  accuracy  obtained  using  supervised  learn¬ 
ing.  In  other  words,  the  clustering-based  classifier  achieved 


Figure  4.  Smoothed  aspect  ratio  signals  for 
subjects  3  (blue  dashed)  and  5  (red  solid). 


Method 

Classification 
Accuracy  (%) 

Supervised 

1-NN  leave-one-out 

76 

Unsupervised 
DTW  init. 

DTW  init. 

51 

K-means 

69 

EM 

69 

Unsupervised 
Smyth’s  init. 

Smyth’s  init. 

71 

K-means 

69 

EM 

69 

Table  2.  Classification  results  for  Georgia- 
tech  gait  database. 


comparable  performance,  but  without  the  need  for  class  la¬ 
belings  provided  in  the  supervised  learning  approach.  This 
is  an  encouraging  result. 

7  Conclusion  and  Future  Work 


In  this  paper  we  presented  an  EM-based  algorithm  for 
clustering  sequences  using  a  mixture  of  HMMs,  and  com¬ 
pared  it  to  the  k-means  algorithm  typically  used  in  this  con¬ 
text.  Our  experiments  show  that  though  the  EM  and  k- 
means  approaches  generally  have  similar  performance,  the 
EM-based  approach  produces  better  estimates  when  there  is 
more  “overlap”  in  the  underlying  densities. 

In  future  work,  we  plan  to  improve  the  clustering  algo¬ 
rithm  by  incorporating  a  robust  method  to  handle  outliers, 
and  a  method  for  learning  the  structure  of  the  HMMs.  Our 
main  goal  then  is  to  make  use  of  motion  clusters  for  efficient 
indexing  and  retrieval  of  similar  motion  sequences. 
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